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Abstract 

Soft real-time applications such as multimedia applications often show bursty 
memory access patterns—regularly requiring a high memory bandwidth for a short 
duration of time. Such a period is often critical for timely data processing. Hence, 
we call it a memory-performance critical section. Unfortunately, in multicore ar¬ 
chitecture, non-real-time applications on different cores may also demand high 
memory bandwidth at the same time, which can substantially increase the time 
spent on the memory performance critical sections. 

In this paper, we present BWLOCK, user-level APIs and a memory bandwidth 
control mechanism that can protect such memory performance critical sections 
of soft real-time applications. BWLOCK provides simple lock like APIs to de¬ 
clare memory-performance critical sections. If an application enters a memory- 
performance critical section, the memory bandwidth control system then dynami¬ 
cally limit other cores’ memory access rates to protect memory performance of the 
application until the critical section finishes. 

From case studies with real-world soft real-time applications, we found (1) 
such memory-performance critical sections do exist and are often easy to identify; 
and (2) applying BWLOCK for memory critical sections significantly improve 
performance of the soft real-time applications at a small or no cost in throughput 
of non real-time applications. 


1 Introduction 

In a multicore system, an application’s performance running on a core can be signifi¬ 
cantly affected by other applications on different cores due to contention in shared hard¬ 
ware resources such as shared Last-Level Cache (LLC) and DRAM. When the shared 
resources become bottlenecks, traditional CPU scheduling based techniques such as 
raising priorities ED or using CPU reservation based approaches 121 ns ID do not 
necessarily improve performance of the real-time applications. 

In hard real-time systems, such as avionics systems, one solution adopted in the 
industry has been disabling all but one core in the system Il20ll to completely eliminate 
the shared resource contention problem, thereby being able to be certified laQ An- 


^The current standard for certification is designed for unicore systems (2) 



other approach, adopted in PikeOS, is a time partitioning technique in which only one 
core is allowed to execute for a set of pre-defined time windows Qol. 

In the context of soft real-time systems, on the other hand, a certain degree of per¬ 
formance variation due to interference is often tolerable. Furthermore, modern multi¬ 
core architecture provides a significant amount of parallelism in the processor archi¬ 
tecture (e.g., out-of-order core design) and the memory subsystems (e.g., non-blocking 
caches and multi-bank DRAM) that can absorb a considerable degree of concurrent 
accesses without noticeable performance impacts na. Therefore, it is highly desirable 
to develop a solution that can provide better real-time performance while still allowing 
concurrent executions to leverage the full potential of multicore. 

In our previous work, we developed a software based memory access control sys¬ 
tem, called MemGuard, that allows concurrent memory accesses from multiple cores— 
up to certain limits—by providing a minimum memory bandwidth guarantees to each 
core in the system EH. One problem of this approach is, however, that the reservable 
amount of bandwidth is very small, compared to the peak memory bandwidth. While 
it tries to maximize performance via a prediction based bandwidth reclaiming, accu¬ 
rate prediction is challenging, especially for bursty memory access patterns, which are 
commonly found in many soft real-time applications. 

In this paper, we present BWLOCK, a user-level API and memory bandwidth con¬ 
trol mechanism to protect performance of soft real-time applications such as multime¬ 
dia applications. Our key observation is that interference is most visible when multiple 
cores have high memory demands at the same time. In such cases, all participating 
cores will be delayed due to queueing and other issues that cannot be hidden by the 
underlying hardware. Therefore, in order to protect performance of real-time applica¬ 
tions, the system must avoid overload situations when the real-time applications have 
high memory performance by executing memory intensive regions of the code. We call 
such a region as memory-performance critical section. Fortunately, in many soft real¬ 
time applications, such as multimedia applications, such memory-performance critical 
sections are often easy to identify via application level profiling techniques. For ex¬ 
ample, using perf in Linux, one can identify functions that have very high memory 
demands. 

Motivated from the observations, BWLOCK provides a lock like API with which 
programmers can describe certain sections of code that are memory-performance criti¬ 
cal. When a memory-performance critical section is being executed, BWLOCK limits 
the amount of allowed memory traffic from the other cores to avoid overloading mem¬ 
ory performance. In cases that modifying source code or profiling is not desired or 
possible, BWLOCK also allows to declare the entire execution of an application as 
memory performance critical so that whenever the application is scheduled on a CPU 
core, its memory performance can be ensured. We call the former as fine-grained band¬ 
width locking and the latter as coarse-grained bandwidth locking. 

We applied BWLOCK in two real-world soft real-time applications—Mplayer and 
WebRTC framework (as part of the chromium-browser)—to protect their real-time per¬ 
formance in the presence of memory intensive non real-time applications. In the case of 
Mplayer, we achieve near perfect isolation for the Mplayer at a cost of 17% throughput 
reduction of the non real-time applications in the coarse-grain mode, or achieve 17% 
better real-time performance for the Mplayer at the cost of only 7% throughput reduc- 


tion of the non-real-time applications in the hne-grain mode. Similar improvements 
are observed for WebRTC as well. 

Our contributions are as follows: 

• We propose an OS mechanism and API that can substantially improve perfor¬ 
mance of soft real-time applications in a multi-programmed environment such 
as cloud systems. 

• We present extensive evaluation results using real-world soft real-time applica¬ 
tions demonstrating the viability and the practicality of the proposed approach. 

The remaining sections are organized as follows: Section [^provides background 
on software based memory access control technique and motivating experiments. Sec¬ 
tion [^presents the design and implementation of BWLOCK. Section [^describes the 
evaluation platforms and the implementation overhead analysis. Section!^ presents 
case study results using two real-world soft real-time applications. SectioniMiscusses 
limitations and possible improvements. We discuss related work in Section^and con¬ 
clude in Section^ 

2 Background and Motivation 

In ED, we proposed a software based memory bandwidth management system called 
MemGuard in Linux kernel. The key idea is to periodically monitor and regulate the 
memory access rate of each core using per-core hardware performance counters. If, for 
example, a group of tasks generates too much memory traffic and delays the critical 
real-time tasks, MemGuard can regulate the memory access rates of the cores running 
the offending tasks. With the regulation mechanism, MemGuard offers a bandwidth 
reservation service that partitions a fraction of available memory bandwidth among 
the cores and ensures the reserved bandwidth is to be guaranteed all the time. The 
bandwidth reservation parameter is chosen statically (albeit they can be modified at 
run-time by the system administrator) for each core. Once the reservation parame¬ 
ter is chosen, the primary goal of MemGuard is to guarantee the reserved bandwidth 
of each core. Static partitioning, however, is inherently inefficient when demands of 
cores change over time as unused bandwidth can be wasted. To minimize bandwidth 
under-utilization due to the static partitioning, MemGuard employs a prediction based 
bandwidth reclaiming mechanism that dynamically re-distributes unused bandwidths 
at runtime. 

There are, however, a few issues when we apply the MemGuard to improve perfor¬ 
mance soft real-time applications. First, MemGuard reserves memory bandwidth on a 
per-core basis. Therefore, when a core hosts different types of applications (with differ¬ 
ent memory bandwidth demands), it is difficult to choose appropriate bandwidth reser¬ 
vation parameters. Second, while this restriction is mitigated—to a certain degree—^by 
the runtime bandwidth re-distribution mechanism, the effectiveness of the mechanism 
depends on its prediction accuracy, which is in general very challenging. In particular, 
multimedia soft real-time applications, which we focus on this paper, often show bursty 
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Figure 1; Memory bandwidth demand changes over time of Mplayer and WebRTC. 
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Figure 2: Average memory access latency of a Latency benchmark as a function of the 
memory bandwidth of each co-runner on three different cores. 


memory access patterns, which are difficult to predict without application level infor¬ 
mation or a sophisticated learning algorithm, which is difficult to implement efficiently 
in the kernel. 

Figure [T] shows memory access patterns of two multimedia applications—^Mplayer 
and WebRTC—collected over a 10 second duration (sampled at every 1ms.) As both 
programs process video/audio frames at a regular interval, when processing a new 
frame, they require high memory bandwidth for a short period of time, while at other 
times their memory bandwidth demands are low as they are executing compute inten¬ 
sive instructions or waiting for the next period. 

When these soft real-time applications compete memory bandwidth with other ap¬ 
plications running on different cores, the short sections of code that demand high mem¬ 
ory bandwidth could suffer a disproportionally high degree of performance impact. 
When the overall memory demand is low, memory access latencies often can be hid¬ 
den due to a variety of latency hiding techniques (e.g., out-of-order) and the abundant 
memory level parallelism in modern multicore architecture 03. However, when the 
overall demand is beyond a certain point, such techniques are no longer able to hide 
latencies and the requests are piled up in various queues in the system, which substan¬ 
tially slowdown all tasks requesting the memory. Figure|^illustrates this phenomenon. 
In this experiment, we measure the average memory access latency (normalized to run- 
alone performance) of Latency m benchmark (a pointer chasing micro-benchmark) 



















































while varying the memory bandwidth demand of co-runners on the other three different 
cores in a quad-core Intel Xeon system (Detailed hardware setup is described in Sec¬ 
tion]^. When the co-runner’s bandwidth is low (lOOMB/s), the performance impact 
to the measured Latency benchmark is negligible. However, as the memory bandwidth 
demand of the co-runners increase (100 - 900 MB/s), the observed delay of the Latency 
benchmark is increased exponentially and then saturated (after 900 MB/s). 

In summary, our observations are as follows; (1) Soft real-time applications such as 
multimedia applications often show bursty memory access patterns—regularly requir¬ 
ing a high memory bandwidth for a short duration of time. (2) Such a period, which 
we call a memory-performance critical section, is often critical for timely data pro¬ 
cessing but it can be disproportionally delayed by bandwidth demanding co-runners on 
different cores. 

These observations motivate us to design a new memory access control system, 
BWLOCK, described in the next section. 


3 BWLOCK 

BWLOCK is user-level APIs and a kernel-level memory bandwidth control mecha¬ 
nism, which is designed to improve performance of soft real-time time applications 
(e.g., multimedia applications.) on multicore systems. It provides simple lock like 
APIs that can be called by the applications to express the importance of memory per¬ 
formance for a given section of code (i.e., memory-performance critical section.) Once 
an application acquires the lock, which we call a memory bandwidth lock, the kernel- 
level memory bandwidth control system allows unlimited memory accesses to the re¬ 
questing task while regulating the maximum allowed memory bandwidth of the other 
cores to avoid excess bandwidth contention, which could delay the task running the 
memory-performance critical section. 

3.1 System Architecture 

Figure]^ shows overall architecture of the proposed system. At the user-level, we pro¬ 
vide two APIs— bw_lock () and bw_unlock {) —to protect memory-performance 
critical sections. When the bw_ (un) lock () is called, the kernel updates the call¬ 
ing process’s state so that whenever the CPLF scheduler schedules the task, the kernel 
can determine whether the task is executing the memory critical section or not. In¬ 
stead of modifying the code, external utilities can also set the bandwidth locks of other 
processes in the system. The per-core bandwidth regulators are activated when there 
are one or more cores executing memory-performance critical sections. In our current 
implementation, the check is periodically (e.g., at every 1ms) performed by software 
based bandwidth regulators. Ideally, however, hardware assisted mechanisms could 
support more fine-grained memory access control (See Section]^ for discussions on 
potential hardware support.) 
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Figure 3; Overall system architecture of BWLOCK. 


API 

Description 

bwJockO 
bw unIock() 

begin a memory-performance critical section 
end a memory-performance critical section 


Table 1: BWLOCK user-level APIs 


3.2 Design and Implementation 

BWLOCK supports fine-grained and coarse-grained bandwidth locking. In fine-grained 
mode, programmers are required to use the APIs in TabIe[T]to declare memory-performance 
critical sections. It allows fine-grain control over memory performance but requires 
detailed profiling information to be effective. Often, such profiling information can 
easily be obtained using publicly available tools such as perf in Linux as we will show 
in our case studies in Section The coarse-grained mode is an equivalent of calling 
bw_lock once, by the program itself or by the external utility, and never releases it. 
Then, whenever the process is scheduled, it automatically holds the bandwidth lock. 

We provide an external tool to set the bandwidth lock of any existing process in the 
system. Therefore, BWLOCK can be applied to unmodified programs, albeit the gran¬ 
ularity of control is the entire duration the task occupies a CPU core. It is important 
to note that unlike traditional locks used for synchronization Q, in which only one 
task can acquires a lock, a bandwidth lock can be acquired by multiple tasks, perhaps 
on different cores, at any given time. In other words, if there are multiple soft real¬ 
time applications who request a bandwidth lock, all of them will be granted to access 
the bandwidth lock. This design is because the primary goal of BWLOCK is to pro¬ 
tect soft real-time applications from memory intensive non real-time applications. In a 
sense, our design is a two-level priority system that prioritizes real-time tasks over non 
real-time tasks in accessing memory. It can be, however, naturally extended to support 
multiple levels of priorities in accessing memory in the future. For example, instead of 
allowing unlimited memory accesses, the task which holds a bandwidth lock can also 
be regulated depending on the priority value associated with the bandwidth lock. 






































Figure shows the kernel-level implementation of BWLOCK. We added an in¬ 
teger value bwlock_val to indicate the status of BWLOCK in the process con¬ 
trol block structure of Linux (task_struct). The value can be updated via a sys¬ 
tem call (See syscall_bwlock {)). Since it simply updates an integer value and 
nothing else, its calling overhead is very small (Overhead analysis is given in Sec¬ 
tion]?^. Each core’s bandwidth regulator (See per_core_period_handler ()) 
periodically checks how many cores are executing memory-performance critical sec¬ 
tions (i.e., task’s bwlock_val > 0). If one or more cores are executing memory- 
performance critical sections, only the cores that hold the bandwidth lock can access 
memory freely (maxperf.budget is an infinite value) while the others are regu¬ 
lated according to minperf Joudget. Note that the minperf Joudget is a sys¬ 
tem parameter that indicates the maximum amount of memory traffic that can co¬ 
exist without significant performance interference. In our current implementation it 
is lOOMB/s. If a PMC overflow interrupt occurs, due to exhausting the bandwidth 
limit, the core is immediately throttled by scheduling a high priority real-time kernel 
thread (kthrottle). The throttled core is re-activated at the beginning of every pe¬ 
riod handler. 


4 Evaluation Setup 

In this section, we present details on the hardware platform and the BWLOCK software 
implementation. We also provide detailed overhead analysis and discuss performance 
trade-offs. 

4.1 Hardware Platform 

We use a quad-core Intel Xeon W3530 based desktop computer as our testbed. The 
processor has private 32K-I/32K-D (4/8 way) LI cache, a private 256 KiB (8 way) 
L2 cache for each core and a shared 8MiB (16 way) L3 cache. The memory con¬ 
troller (MC) is integrated in the processor and connected to a 4GiB 1066 MHz DDR3 
memory module. The graphic card is NVIDIA GeForce 8400. We disabled turbo¬ 
boost, dynamic power management, and hardware prefetchers for better performance 
predictability and repeatability. 

4.2 Implementation Details and Overhead Analysis 

We implemented BWLOCK in Linux version 3.6The kernel’s task.struct is 
modified according to Figure For memory bandwidth control, BWLOCK uses a 
modified version of MemGuard kernel module ED- 

There are two major sources of overhead in BWLOCK: system call and interrupt 
handling. First, in the fine-grained setting, two system calls are required for each mem¬ 
ory critical section. In our current implementation, a single system call is used to im¬ 
plement both bw_lock() and bw_unlock(). The system-call overhead is small: 125.24ns 

^BWLOCK will be publicly available at https://github.com/heechul/bwlock 



// task structure 
struct task.struct { 

int bwlock.val ; III — locked , 0 — unlocked 

// bwlock system call 
syscall.bwlock (pid.t pid , int val) 

{ 

struct task.struct *p; 
if (pid == 0) 

p = current; // ’current ’ <— calling task 
else 

p = find.process_by_pid(pid); 
p—>bwlock_val = val ; 
return 0; 

} 

// periodic handler called by the 
// bandwidth regulators 
void per.core.period.handler 0 
{ 

// re —activate the suspended core 
if (current == kthrottle) 
deschedule (kthrottle ) ; 

if (nr.bwlocked.cores 0 > 0) { 

// one or more cores requested bwlock 
if ( current—>bwlock_val > 0) 
budget = maxperf.budget; 
else 

budget = minperf.budget ; 

} else { 

// no cores requested bwlock 
budget = maxperf .budget; 

} 

// program the core’s performance counter 
// to overflow at ’budget’ memory accesses 
} 

// PMC overflow handler 
void per.core.overflow.handler 0 
{ 

// stall the core till the next period 
// kthrottle <— high priority idle thread 
schedule ( kthrottle ) ; 

} 


Figure 4: BWLOCK kernel implementation 



Period (us) 

Overhead (%) 

100 

3.5 

250 

1.5 

500 

0.9 

1000 

0.7 

2500 

0.5 


Table 2; Period interrupts handling overhead 


on average (out of 10,000 iterations.) as it simply changes a single integer value in the 
task’s task_struct. 

Second, in our current implementation, to monitor which cores are having the band¬ 
width lock, a periodic timer handler is being used as shown in Figure And actual 
access control is performed by a performance counter overflow interrupt handler. Al¬ 
though the overflow handler is not in the critical path of normal program execution, the 
period timer interrupt is pure overhead that is added to the task’s execution time, just 
like the OS tick timer handler in standard operating systems. We quantified the period 
interrupt handling overhead by measuring the execution increases of a benchmark. Ta¬ 
ble]^ shows the measured overhead (i.e., percentage of the increased execution time.) 
under different period lengths. Based on this result, we use 1ms period unless noted 
otherwise. 


5 Evaluation Results 

In this section, we presents case-study results using two real-world soft real-time applications 
Mplayer (a video player) and WebRTC ifTTIl (a multimedia real-time communication 
framework for browser based web applications)—to evaluate the effectiveness of BWLOCK. 

5.1 Mplayer 

Mplayer is a widely used open-source video player. In the following set of experi¬ 
ments, our goal is to protect real-time performance of the Mplayer(s) in the presence 
of memory intensive co-running applications while still maximizing overall throughput 
of the co-runners. 

In the first set of experiments, one Mplayer instance plays an H264 movie clip with 
a frame resolution of 1920x816 and a frame rate of 24fps. We slightly modified the 
source code of Mplayer to get the per-frame processing time and other statistics. De¬ 
coded video frames are displayed on screen via a standard XI1 server process. There¬ 
fore, the Mplayer and the XI1 have soft real-time characteristics. 

5.1.1 Profiling 

To understand their memory-performance characteristics, we collect function level 
profiling information—cache-misses and cycles of each function—with the perf tool, 
which uses hardware performance counters. The profiled information of Mplayer and 





XI1 is shown in Table and respectively. In both cases, the functions that gen¬ 
erate most of memory traffic were identihed: yuv4 2 0_rgb32JyiMX in Mplayer and 
sse2_blt inXll. Note that each function is responsible for more than 50% of to¬ 
tal LLC-misses of its respective application, while is responsible for much less CPU 
cycles (27.8% and 32.85% respectively). Therefore, they are prime candidates for ap¬ 
plying BWLOCK. Due to the restrictions of our current implementation—integration 
of periodic bandwidth regulation—it is also important to know the duration of each 
function; if it is too short, BWLOCK may not be able to regulate co-runners’ memory 
accesses when needed. Table shows the average and 99 percentile execution times 
of the functions. Fortunately, both functions are long enough to be regulated by the 
bandwidth control mechanism of BWLOCK. 


LLC misses 

Cycles 

Function 

51.6% 

21.mo 

yuv420_rgb32jyiMX 

18.8% 

9.3% 

prefetch_mmx2 

4.5% 

7.3% 

hLdecode mb simple 8 


Table 3: Prohled information of Mplayer 


LLC misses 

Cycles 

Function 

53.29% 

32.85% 

sse2_blt 

24.13% 

24.19% 

fbBlt 

14.10% 

19.61% 

sse2 composite over 8888 88888 


Table 4: Prohled information of XI1 


Average 

duration 

99 pet. 
duration 

Function 

Application 

2.9ms 

l.lms 

4.2ms 

2.9ms 

yuv420_rgb32jyiMX 
sse2 blt 

Mplayer 

Xll 


Table 5; Timing statistics of memory intensive functions 


5.1.2 Performance comparison 

To investigate the effectiveness of BWLOCK, we conducted a set of experiments. We 
hrst run the Mplayer alone (with the X-server) to get the baseline performance. In 
order to generate memory interference, we use two instances of a memory intensive 
synthetic benchmark ED, referred as bw-write. We also measure their baseline per¬ 
formance in isolation. We co-schedule all four processes—^Mplayer, XI1, and two 
bw_write instances—at the same time in four different conhgurations. For convenience 
of monitoring and measurements, each process is assigned to a dedicated core using a 
cpitaffinity facility in Linux. Note that all four processes are single-threaded. In De¬ 
fault, we use a standard vanilla Linux kernel. In MemGuard, we use MemGuard OTIl : 












static inline int yuv420_rgb32_MMX 
(SwsContext *c , const uintS.t *src[]. 


{ 

bw_lock(); // added 

YUV2RGB_LOOP(4) 

bw.unlock () ; // added 

} 

Figure 5; Code modification example for fine-grained application of BWLOCK. 



Figure 6: Normalized performance of Mplayer (average frame time) and co-running 
Bandwidth benchmarks (MB/s): 1 xMpalyer and 2xBandwidth instances. 


the memory bandwidth budgets are configured as 450, 450, 100, and lOOMB/s for 
CoreO to 3, respectively, and predictive bandwidth re-distribution is enabled; note that 
Mplayer (CoreO) and XI1 (Corel) are reserved more bandwidths than the co-runners. 
In BWLOCK(fine), we manually insert bw_lock and bw.unlock in the previously 
identified memory intensive functions of Mplayer and XI1 (Table |^, as shown in Fig¬ 
ure]^ Lastly, in Bwlock(coarse), both Mplayer and XI1 are not modified but config¬ 
ured to automatically hold the bandwidth lock whenever they are scheduled. 

Figure shows results. For Mplayer, performance is measured by the average 
frame processing time, normalized to run-alone performance. For co-running bw.write 
benchmarks, performance is measured by the aggregated throughput (MB/s) of the two. 
In the figure, performance is normalized to each application’s baseline performance 
measured in isolation. In Default, Mplayer’s performance is significantly suffered— 
dropped by 51%—due to memory contention with the co-running bw.write instances, 
which are much less affected—dropped by 22%. This kind of disproportional per¬ 
formance impact is common in COTS multicore systems and is caused by a combi¬ 
nation of application memory characteristics and DRAM controller’s scheduling pol¬ 
icy ll25l fT9]l . In MemGuard, Mplayer’s performance is better protected—dropped by 
32%—as more memory bandwidth is reserved for it. However, this comes at a cost of 
considerable performance reduction of the co-runners—only 51% the baseline perfor- 
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Figure 7: Per-core memory access patterns. 
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Figure 8: Frame processing time comparison. 


mance. In BWLOCK(fine), on the other hand, both Mplayer’s and co-runners’ perfor¬ 
mance are improved over MemGuard—by 12% for Mplayer and 13% for co-runners. 
This is because memory-performance critical sections in the Mplayer, identified from 
profiling, are protected from being interfered by the co-runners’ memory accesses us¬ 
ing the explicit bw_lock and bw_unlock. Lastly, in BWLOCK(coarse), the Mplayer 
is unmodified but whenever it is scheduled, it automatically calls the bandwidth lock 
by the CPU scheduler. As a result, the Mplayer’s performance is almost identical to the 
baseline performance. However, because the entire duration of Mplayer’s processing 
is protected by the bandwidth lock, even if it doesn’t access memory, the co-runners’ 
performance is slightly further degraded. 

Figure shows the memory access pattern of each core. The y-axis shows the 
number of LLC misses of each core for every one millisecond period. Note that Core2 
and Core3 have a constant memory demand when they run in isolation. In Default, 
whenever Mplayer and/or XI1 begin processing and demand high memory bandwidth, 
all tasks suffer considerable bandwidth contention. In MemGuard, we can observe 
that Mplayer (and XI1) is getting more bandwidth than the bw_write when needed. 
However, due to difficulties of making accurate predictions on future usage, which 
MemGuard relies on, its demand is not always satisfied. In both BWLOCK(fine) and 
BWLOCK(coarse), on the other hand, we can observe co-runners are immediately reg¬ 
ulated upon arrivals of Mplayer’s memory demands; hence Mplayer can achieve near 
identical to its baseline performance in isolation. 

Figure|^shows frame processing time in different system configurations. Note that 
Solo represents Mplayer’s baseline performance measured in isolation. BWLOCK(coarse) 
is mostly overlapped with Solo. BWLOCK(fine) and MemGuard take longer in pro¬ 
cessing frames and Default, as expected, takes the longest in most frames. 

5.1.3 Overloaded System 

So far, we have assigned one task per core and both Mplayer and XI1 do not consume 
100% cycles of the assigned core. In other words, the system is under-utilized. In order 
to investigate how BWLOCK performs in an overloaded system, we performed another 
set of experiments in which each core runs a Mplayer and a bw_write instance (i.e., four 
Mplayer instances and four bw_write instances) to fully load the system. Performance 
metrics are the same: average frame processing time of Mplayer and the aggregate 
bandwidth of bw.write. Figure shows the results. Notice that, in this experiment 
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Figure 9; Normalized performance of Mplayer (average frame time) and co-running 
Bandwidth benchmarks (MB/s): 4xMplayer and 4xBandwidth instances. 


setup, all cores run both real-time and non-real-time tasks. Therefore, MemGuard’s 
core-based bandwidth partitioning, which prioritizes certain cores over the others, is 
not appropriate. Hence, we only compare the results of Default and the two BWLOCK 
settings (fine and coarse). As shown in the figure, both BWLOCK settings provide 
good performance isolations for the Mplayer instances at the cost of more degraded 
performance for the co-runners which do not request bandwidth locks. Note that our 
current BWLOCK implementation does not limit the number of tasks that can hold 
bandwidth locks at a given time. Therefore, memory contention among the soft real¬ 
time tasks, which hold bandwidth locks on different cores, could potentially cause 
delay with each other. The performance reduction of Mplayer in BWLOCK(coarse) 
is not from the contention from the bw_write instances but is entirely from the co¬ 
running Mplayer instances—we verified this by comparing it with the result obtained 
by running only four instances of Mplayer without the bw_write instances. 

5.2 WebRTC 

WebRTC is an open source, plug-in free, RTC (real-time communication) platform 
for enabling audiovisual, network-based applications between browsers. The goal of 
this experiment is to provide real-time performance isolation to WebRTC sessions, in 
the presence of memory intensive co running applications on multi-core platforms. We 
also investigate the side effects of different isolation mechanisms on the performance of 
CO runners. The setup is configured to achieve negligible congestion in the network by 
having two communication hosts directly connected through a Gigabit Ethernet switch. 
Hence, the performance variability observed is entirely because of resource contention 
in the host itself. WebRTC utilizes GCC (Google Congestion Control) algorithm to de¬ 
rive target bit-rate of audiovisual streams based on the resource contention in network, 
and the end hosts 0. The frame rate and sending bandwidth are adjusted to match the 
available resources at any given time. The default resolution of 640x480, and frame 
rate of 30 EPS is used for experimentation, while the threshold bandwidth is increased 
to 4 Mb/s from default 2 Mb/s. LBM benchmarks from SPEC2006 suite are chosen 
as co-running applications. Since XI1 server is the front end of the WebRTC, they are 
considered together as group, and assigned to share the CPU cores in cgroups. While, 
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Figure 10: Normalized performance of WebRTC (average bandwidth) and co-running 
LBM benchmarks (MB/s) 


Ibm co-runners are allocated to remaining two CPUs belonging to another CGROUP. 

5.2.1 Profiling 

Similar to MPlayer, to understand the memory access pattern, we collected function 
level profiling information for WebRTC, using Linux perf tool. This time we only 
focused on cache-miss events, to understand the memory access behavior. Functions 
skjnemset 32_SSE2 and S32A_Opaque_BlitRow32_SSE2 from Skia library seems 
to cause more than 50% (29.29% and 22.95% respectively) of cache misses during a 
WebRTC sessions. The mean execution length of these functions is 7.5 us, while more 
than 99% of sample values being less than 100 us. The function execution length 
is much smaller than that were observed with MPlayer profiled functions. These 
functions didn’t seem ideal for applying fine grained BWLOCK, as the minimum 
BWLOCK period is 1 ms. We think that large number of invocation of these low level 
graphics functions are being made by higher level subroutine(s). Bursty invocation of 
these functions might lead to aggregated continuous time periods (during which these 
functions are active), in the order of regulation period of BWLOCK. So, we experi¬ 
mented fine grained BWLOCK on above two discovered functions to understand the 
effects of fine grained memory bandwidth regulation, compared it’s performance with 
other isolation techniques, namely, coarse-grain BWLOCK, MemGuard, and Default 
(CPUSET partitioning). Similar to MPlayer approach, entry and exit (bw_lock and 
bw.unlock) calls are introduced during which sufficient memory bandwidth (1000 
MB/s) is reserved for the corresponding cpu cores, while other cores bandwidth quota 
is set to 100 MB/s. 

5.2.2 Performance 

Figure[T^shows the normalized performance of WebRTC and co-running LBM(s) with 
different isolation mechanisms. Coarse-grain BWLOCK achieves near perfect perfor¬ 
mance isolation for WebRTC from co running LBM tasks, albeit with heavy penal¬ 
ization for CO runners. WebRTC process consists of ~20 threads, and out of which. 














couple of threads are involved in encoding and decoding of video. Hence, the coarse 
grain mechanism over reserves the bandwidth for WebRTC process, leaving very small 
spare bandwidth for co runners. With fine grained BWLOCK (by using bw.lock and 
bw_unlock) on profiled graphic functions, the performance of WebRTC improves 
without much penalty to co running tasks. Since the two profiled functions contribute 
around 50% of cache misses, perfect isolation is not achieved, at the same time, many 
non-core threads (threads not involved in encoding and decoding of video) are not 
bandwidth reserved leaving sufficient room for LBM co runners. Some performance 
penalty is incurred due to very small execution duration of profiled function leading 
to incensed overhead of system calls. By using MemGuard in reclaim and sparing 
sharing mode, we could achieve perfect isolation for WebRTC performance with more 
50 % penalty for co runners. In comparison to MemGuard, BWLOCK is a dynamic, 
on-demand kind of mechanism, whereas, MemGuard requires static, pre-determined, 
per core bandwidth allocation. All the approaches achieve better real-time WebRTC 
performance compared to Default (CPUSET partitioning alone). 

Table 1^ shows the important metrics of WebRTC. The results correspond to the aver¬ 
age bandwidth achieved by WebRTC in specific scenarios. Except for Default (CPSET 
alone partitioning) and fine-grained BWLOCK, all configurations provide complete 
isolation to WebRTC from co-runners, albeit, having varying degree of penalty on co 
running applications. A clear trade-off emerges, with Default, BWLOCK(fine), Mem¬ 
Guard, and BWLOCK(coarse) providing increasing levels of isolation to WebRTC, 
while increasing penalty for co runners. As GCC kicks in during resource contention, 
the bandwidth/frame rate is dynamically adapted leading to reduction in bandwidth 
and/or frame rate. These parameters together determine the achieved audiovisual qual¬ 
ity. 


Config. 

RTT (ms) 

ER (EPS) 

BW (kb/s) 

Default 

17.20 

21.34 

2917.85 

BWLOCK(fine) 

4.22 

29.58 

2229.10 

MemGuard 

2.24 

29.98 

4019.16 

BWLOCK(coarse) 

2.22 

30.00 

4025.30 


Table 6: WebRTC internal performance metrics 


6 Discussion 

In this section, we discuss limitations of our approach and future improvements. 

6.1 Hardware Assisted Memory Bandwidth Control 

A significant limitation of our current approach is our software based periodic monitor¬ 
ing and bandwidth controlling mechanism in which the control granularity is limited to 
a millisecond range due to the interrupt handling overhead. This means the detection 
and application of bandwidth lock can be delayed up to the timer period. While this 





may not be a serious issue in many soft real-time applications as we have shown in 
this paper, there may be other applications in which such delay are not tolerated. This 
limitation can easily be overcome via hardware support in the memory controller or 
the CPU. For example, hardware can expose a set of registers—that control the mem¬ 
ory access priorities in the DRAM controller ifT^ or the size of MSHR in the shared 
cache ill —to the kernel. Then BWLOCK can simply update such registers to protect 
memory performance critical sections. 

6.2 Application to Hard Real-Time Systems 

Although we focus on soft real-time applications, we believe BWLOCK can also be 
applied to hard real-time systems in some cases. For example, it is possible to designate 
a single core to execute all hard real-time applications while the other cores execute 
non real-time applications. In such a scenario, we can apply BWLOCK to all hard 
real-time tasks on the designated core to ensure that while any of the hard real-time 
tasks execute, all other cores’ maximum memory bandwidth usage could be limited 
to a certain number. Then, non real-time tasks and hard real-time tasks can safely 
co-exist without needing to worry about excessive memory contention. Especially, 
with hardware support mention earlier, such design can be used for systems that need 
certification 0. 

7 Related Work 

os level memory access control was first discussed in literature by Bellosa 0. The 
basic idea is to reserve a fraction of memory bandwidth for each core 0[3Ol|3ll (or 
task 1131 ) by means of software mechanisms—e.g., TLB handler 0 or hardware per¬ 
formance counter interrupts uni [El- One problem of the memory bandwidth reser¬ 
vation approach is that by partitioning memory bandwidth among the cores (or tasks), 
usable bandwidth can be substantially wasted if the reserved bandwidth is not being 
fully used by the reserved core (task). The work in 1311 partly solves the problem by 
supporting dynamic reclaiming and sharing that re-distribute memory bandwidth of the 
cores that under-utilize their reserved bandwidth to the cores that need more than their 
reserved bandwidth. 

However, the effectiveness of the techniques depends on cores’ memory access 
patterns and the accuracy of future usage predictions. In general, memory bandwidth 
reservation systems are not ideal in efficiently utilizing available memory bandwidth— 
which is essential in many soft real-time systems where real-time applications are co¬ 
scheduled with non real-time applications—^because the reserved bandwidth for certain 
real-time tasks would result in under-utilization of the memory subsystem. In contrast, 
BWLOCK allow unrestricted memory accesses for most of the time, hence leverag¬ 
ing full benefits of parallelism available in modern multicore architecture, but limit 
excessive concurrent memory accesses from non real-time tasks only when doing so 
would likely affect performance of the soft real-time tasks that are executing memory- 
performance critical sections. We find that these selective regulations are more efficient 


in utilizing memory bandwidth while still providing good real-time performance than 
the reservation based approaches. 

In the context of proving performance isolation in multicore systems, software 
based cache partitioning technique, known as page coloring, has been extensively stud¬ 
ied II 22 I 171 [TtI l24l l28l 1^ . The basic idea is to allocate memory pages of certain 
physical addresses such that each core accesses different part of the cache-sets. This 
way, cache can be effectively partitioned without needing special cache hardware. A 
downside of this approach is, however, that it is very costly to change the size of par¬ 
tition at runtime. More recently, page coloring has been applied to partition DRAM 
banks Il23ll27ll29ll . In line with the problems of bandwidth partitioning, however, these 
shared space resource partitioned (cache and DRAM bank space) resources can be 
wasted if they are not utilized by the reserved cores or tasks. Nevertheless, these space 
partitioning techniques can reduce the degree of interference experienced by concur¬ 
rent tasks and othorgonal to our approach. 

There have been many hardware proposals that allow communications between the 
system software (OS) and the hardware to make better resource scheduling/allocation 
decisions. For example, many DRAM controller design proposals allow the OS to set 
priorities, on a per-core basis, on memory request scheduling ifTSl fT^ l26l fTSl . More 
recently, Intel’s new Xeon architecture begins to expose shared resource allocation 
interfaces, currently restricted to partitioning the LLC but the interface is generic which 
can support controlling other shared resources such as DRAM, to the OS iflTl Such 
hardware support can be especially useful for BWLOCK because the software based 
periodic bandwidth control mechanism can be replaced by more efficient hardware 
mechanisms with lower overhead. 


8 Conclusion 

We have presented BWLOCK, a user-level API and kernel-level memory bandwidth 
control mechanism, designed to protect performance of soft real-time applications such 
as multimedia applications. It provides simple lock like APIs to declare memory- 
performance critical sections in the application code. When an application accesses 
a memory critical section, BWLOCK automatically regulates the other cores’ so that 
they cannot cause excessive memory interference. 

We applied BWLOCK in two real-world soft real-time applications—Mplayer and 
WebRTC framework—to protect their real-time performance in the presence of mem¬ 
ory intensive non real-time applications that share the same machine. In both cases, 
we were able to achieve near perfect real-time performance, or to choose not perfect— 
but still better than the vanilla Linux—real-time performance for minimal throughput 
reductions of non-real-time applications. 

Our future work includes hardware assisted bandwidth control for better control 
quality and compiler based automatic identification of memory-performance critical 
sections in soft real-time applications. 
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